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Abstract 

We introduce the Hamming Ball Sampler, a novel Markov Chain Monte Carlo algorithm, 
for efficient inference in statistical models involving high-dimensional discrete state spaces. 
The sampling scheme uses an auxiliary variable construction that adaptively truncates the 
model space allowing iterative exploration of the full model space in polynomial time. The 
approach generalizes conventional Gibbs sampling schemes for discrete spaces and can be 
considered as a Big Data-enabled MCMC algorithm that provides an intuitive means for 
user-controlled balance between statistical efficiency and computational tractability. We 
illustrate the generic utility of our sampling algorithm through application to a range of 
statistical models. 


Statistical inference of high-dimensional discrete-valued vectors or matrices underpins many 
problems across a variety of applications including language modelling, genetics and image 
analysis. Bayesian approaches for such models typically rely on the use of Markov Chain Monte 
Carlo (MCMC) algorithms to simulate from the posterior distribution over these objects. The 
effective use of such techniques requires the specification of a suitable proposal distribution that 
allows the MCMC algorithm to fully explore the discrete state space whilst maintaining sampling 
efficiency. While there have been intense efforts to design optimal proposal distributions for 
continuous state spaces, generic approaches for high-dimensional discrete state models have 
received relatively less attention but some examples include the classic Swendsen-Wang algorithm 
[T] for Ising/Potts models and more recent Sequential Monte Carlo methods [2]. 

In this paper we consider Bayesian inference using MCMC for an unobserved latent discrete¬ 
valued discrete sequence or matrix X G A, where each element Xij G {!,..., S'}, given ob¬ 
servations y = [yi,... ,yi\f]. We will assume that the observations are conditionally indepen¬ 
dent given X and model parameters 9 so that the joint distribution factorizes as p(y, X, 6) = 

n^iP(yi|X, 0) p(X,0). We further assume that the posterior distribution p(X, 0|y) has a 
complex dependence structure so that standard MCMC schemes, such as a (Metropolis-within) 
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Gibbs Sampler, using 


0^p(0|X,y) 

X^p(X|0,y) 


( 1 ) 

( 2 ) 


or a marginal Metropolis-Hastings sampler over 6 based on 

0 ^ p{9\y) oc p{y, X, 6), (3) 

xeA” 

are both intractable because exhaustive summation over the entire state space of X has expo¬ 
nential complexity. 

A popular and tractable alternative is to employ block-conditional (Metropolis-within) Gibbs 
sampling in which subsets x* of X are updated conditional on other elements being fixed using 

9^p{e\X,y), (4) 

Xj ^ p(xj|X_i,6',y),Vi, (5) 

where X_j denotes the elements excluding those in Xj. Typical block structures might be 
rows/columns of X, when it is a matrix, or sub-blocks when X is a vector. Whilst block- 
conditional sampling approaches are often convenient (they may be of closed form allowing 
for Gibbs sampling without resort to Metropolis-Hastings steps), in high dimensions, major 
alterations to the configuration of X maybe difficult to achieve as this must be done via a 
succession of small (possibly low probability) incremental changes. Gonditional sampling may 
lead to an inability to escape from local modes in the posterior distribution particularly if the 
elements of X exhibit strong correlations with each other and together with 9. 

To address these problems, we propose a novel and generic MGMG sampling procedure 
for high-dimensional discrete-state models, named the “Hamming Ball Sampler". This sampling 
algorithm employs auxiliary variables that allow iterative sampling from slices of the model space. 
Marginalization within these model slices is computationally feasible and, by using sufficiently 
large slices, it is also possible to make significant changes to the configuration of X. The 
proposed sampling algorithm spans a spectrum of procedures that contains the marginal and 
block-conditional Gibbs sampling strategies as extremes. At the same time, it allows the user to 
express many more novel schemes so that to select the one that best balances statistical efficiency 
and computational tractability. In this sense, the Hamming Ball Sampler is an example of a Big 
Data-enahled MGMG algorithm. 

We demonstrate the utility of the sampling procedure with three different statistical models 
where exhaustive enumeration is impossible for realistic data sets and illustrate the considerable 
benefits over standard sampling approaches. 

1 Theory 

1.1 Construction 

The Hamming Ball Sampler considers an augmented joint probability model that can be fac¬ 
torized as p(y,X,0, U) = p(y, X, 0)p(U|X) where the extra factor p(U|X) is a conditional 
distribution over an auxiliary variable U which lives in the same space and has the same dimen¬ 
sions as X. The conditional distribution p(U|X) is chosen to be an uniform distribution over a 
neighborhood set l^m(X) centered at X, p(U|X) = ^I(U G 'Hm(X)), where I(-) denotes the 
indicator function and the normalizing constant Zm is the cardinality of T-LmO^)- 
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The neighborhood set T-LmO^) will be referred to as a Hamming Ball since it is defined 
through Hamming distances so that HmO^) = {U : d{ui, Xj) < m, i = 1, ..., P}. Here, d{xi,Ui) 
denotes the Hamming distance 7^ ^ij) pairs (uj,Xj) denote non-overlapping 

subsets of corresponding entries in (U, X) such that = U and U^^^Xj = X. Also, the 

parameter m denotes the maximal distance or radius of each individual Hamming Ball set. For 
instance, these pairs can correspond to different matrix columns so that Xj will be the i-th 
column of X and Uj the corresponding column of U. Hence, the Hamming Ball ?^m(X) would 
consist of all matrices whose columns are at most m elements different to X. 


1.2 Gibbs sampling 

The principle behind the Hamming Ball Sampler is that the use of Gibbs sampling for the 
augmented joint probability distribution p(y, X, 6, U) admits the target posterior distribution 
p(X, 0|y) as a by-product (since marginalization over U recovers the target distribution). Specif¬ 
ically, the Hamming Ball Sampler alternates between the steps: 

U^p(U|X), 

(0,X)^p(0,X|y,U) 

The update of {9, X) can be implemented as two conditional (Gibbs) updates: 

9^p{9\X,y), (8) 

X^p(X|0,U,y). (9) 

Or, alternatively, as a joint update via a Metropolis-Hastings accept-reject step that draws a 
new (0',X') from the proposal distribution (5(0^X'|0, X) = p(X'|0', U, y)g(0'|0) and accepts it 
with probability min ^1, v ’y)q{e'\e) ) ’ where q{9'\6) is a proposal distribution over the model 
parameters. 

The above algorithms consist of generalizations of the (Metropolis-within) Gibbs and marginal 
schemes outlined in (Q-(|^ and (|^ since the latter algorithms are obtained as special cases when 
the radius m becomes large enough (see SI A: Further details about the Hamming Ball Sampler). 


( 6 ) 

(7) 


1.3 Restricted state space 

Grucially, the restricted state space defined by the Hamming Ball, that has been injected into 
the model via the auxiliary factor p(U|X), means that the conditional distribution p(X.\9, U, y) 
can be tractably computed as 


p(X|0,U,y) 


p(y,X,g)I(X£H^(U)) 

p(6',U,y) 


( 10 ) 


where p{9, U, y) = X]x'e'Hm(U) Piy^ 9) is the normalizing constant found by exhaustive sum¬ 
mation over all admissible matrices inside the Hamming Ball ?^m(U). Through careful selection 
of m, the cardinality of l^m(U) will be considerably less than the cardinality of X so that exhaus¬ 
tive enumeration of all elements inside the Hamming Ball would be computationally feasible. 

Overall, the proposed construction uses the auxiliary variable U to define a slice of the model 
given by ^^(U). Sampling of (0,X) is performed within this sliced part of the model through 
p(0,X|y,U). At each iteration, this model slice randomly moves via the re-sampling of U in 
step (|^, which simply sets U to a random element from Hm(X). This re-sampling step allows 
for random exploration that is necessary to ensure that the overall sampling scheme is ergodic. 
The amount of exploration depends on the radius m so that U can differ from the current state 
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of the chain, say at most in mP elements, i.e. the maximum Hamming distance between 
U and is mP. Similarly, the subsequent step of drawing the new state, say is 

such that at maximum can differ from U in mP elements and overall it can differ from 

the previous X*-*^ at most in 2mP elements. From these observations we can conclude that 
a necessary condition for the algorithm to be ergodic is that m > 0. Figure [T}\ graphically 
illustrates the workings of the Hamming Ball Sampler. 

1.4 Selection of blocks 

The application of the Hamming Ball Sampler requires the selection of the subsets or blocks 
{xi,... ,xp}. This selection will depend on the conditional dependencies specihed by the sta¬ 
tistical model underlying the problem to be addressed. For some problems, such as the tumor 
deconvolution mixture model considered later, there may exist a natural choice for these subsets 
(e.g. columns of a matrix) that can lead to efficient implementations. For instance, under a 
suitable selection of blocks the posterior conditional p(X|0,U, y) could be fully factorized, i.e. 
p(X|0,U, y) = n^i y)) have a simple Markov dependence structure (as for the 

factorial hidden Markov model example considered later) so that exact simulation of X would 
be feasible. In contrast, for unstructured models, where X is just a large pool of fully dependent 
discrete variables (stored as a ZD-dimensional vector), we can divide the variables into randomly 
chosen blocks Xj, z = 1,... ,P, so that they have equal length K = length(xj). In such cases, 
exact simulation from p(X|0,U, y) may not be feasible and instead we can use the Hamming 
Ball operation to sequentially sample each block. More precisely, this variant of the algorithm 
can be based on the iteration with the only difference that the steps ^ and © are 

now split into P sequential conditional steps, 

Ui^p(uj|xj), Xj ^p(xi|X_i,6',Ui,y),Vz. (11) 

This scheme can be thought of as a block Hamming Ball Sampler which incorporates standard 
block Gibbs sampling (see iterations (i-©) as a special case obtained when the radius m is 
equal to the block size K. In a purely block Hamming Ball scheme we will have m < K and in 
general the parameters (m, K) can be used to jointly control algorithmic performance (see SI: 
Fig. 1). This scheme is illustrated in Figure]^ and used later in the regression application and 
discussed in further detail in SI B.l: Blocking strategies. 

1.5 Computational complexity 

To hnd the time complexity of the Hamming Ball Sampler we assume for simplicity that either 
p(X|0,U,y) factorizes across the blocks or we use the block scheme in ( |11[ ). Then, for P 
blocks of size K, the computational complexity of the Hamming Ball Sampler scales with the 
Hamming radius m, block size K and P according to 0{MP) where M = Yl'jLoiS — l)'^(^)- 
Contrast this with the block Gibbs Sampler which has computational complexity of 0{S^P) 
(where S^ = — l)-^ ('^)) and it is only applicable for small values of block size K. On 

the other hand, Hamming Ball sampling is more flexible since it can allow to use much larger 
block sizes by controlling the computational cost through both K and m. An ideal choice of 
the Hamming radius is m = iF/2 since, as discussed previously, it is possible to change 2m 
elements per Hamming Ball sampling update so with m = K/2 it is possible to update all K 
elements in each block in a single update. In the Results section we shall see this outcome 
empirically in actual applications, however, for large-scale problems, it may not be practical 
to use Hamming distances beyond m = 1,2,3. Our simulations will show that even in these 
circumstances the Hamming Ball Sampler can still be advantageous by providing the flexibility 
to balance statistical and computational efficiency. 
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1.6 Extensions 


A simple generalization of the algorithm is obtained by allowing block-wise varying Hamming 
maximal distances. If we assnme a varying radins for the individual Hamming Balls, then the 
conditional distribntion over U becomes nniform on the generalized Hamming Ball = 

{U : d{ui,Xi) < mi,i = where m = (mi,...,mp) denotes the set of maximal 

distances for each snbset of variables. Fnrthermore, we conld allow m, at each iteration, to be 
randomly drawn from a distribntion p(m) (see SI B.2: Randomness over the Hamming Ball 
radius for fnrther discnssion). 

Alternate anxiliary conditional distribntions p(U|X) are also permitted. For instance, a more 
general anxiliary distribntion can have the form p(U|X) oc exp(— A(i(U, X))I(U G l^m(X)), with 
A > 0, which for A > 0 is non-nniform and places more probability mass towards the center X 
(see SI B.3: Non-uniform auxiliary Hamming Ball distributions for fnrther discnssion). 

2 Results 

We now illnstrate the ntility of onr Hamming Ball sampling scheme throngh its application to 
three statistical models that involve high-dimensional discrete state spaces. We motivate the 
selection of each model throngh real scientihc problems. 

2.1 Tumor deconvolution through mixture modelling 

Tnmor samples are genetically heterogeneons and typically contain an nnknown nnmber of dis¬ 
tinct cell snb-popnlations. Cnrrent DNA seqnencing technologies ordinarily prodnce data that 
comes from an aggregation of these snb-popnlations thns, in order to gain insight into the la¬ 
tent genetic architectnre, statistical modelling mnst be applied to deconvolve and identify the 
constitnent cell popnlations and their mntation prohles. 

In order to tackle this problem, [3] and H] adopt a statistical framework in which the set 
of nnobserved mntation prohles can be described as a K x N binary matrix X, where Xki = 
0/1 denotes absence/presence of a mntation in the k-th. cell type for the i-th mntation, and 
parameters 9 = {9i,... ,9 k} denote the proportion of the tnmor attribnted to each of the K cell 
types. The observed seqnence data y depends on (9, X) throngh the mntant allele freqnencies 
(j) = |0^X and we wonld like to simnlate from the posterior distribntion P(0, X|y) and explore 
the possible conhgnrations of the mntation matrix X that are most compatible with the observed 

dataJ3 

We compared three posterior sampling approaches: (i) a conventional block Gibbs sampling 
strategy that proceeds by conditionally npdating one colnmn Xj at a time with the remaining 
colnmns X_j and the weights 9 hxed, (ii) a Hamming Ball-based sampling scheme where we 
dehne the Hamming Ball as the set of all matrices snch that each colnmn Xj is at most m 
bits different from the corresponding colnmn Uj of the anxiliary matrix U and (iii) a fully 
marginalized sampling strategy where X was marginalized throngh exhanstive snmmation over 
all colnmn conhgnrations (note, this corresponds to the Hamming Ball Sampler with m = K). 
Onr data examples was chosen to be snfficiently small so that the fnlly marginalized sampler 
was practical. We refer to Materials and Methods for detailed derivations and implementations 
of the following data examples. 

We considered a simnlated data example, illnstrated in Fignre 2A, where the observed se¬ 
qnence data is generated so that it can be eqnally explained by two different latent genetic 

^In |3] and [4] statistical inference was conducted using deterministic approaches and not Monte Carlo sam¬ 
pling. They use massive numbers of random initializations to overcome local modes in the posterior. As our 
interest is in full posterior characterization we do not compare with the point estimates given by these methods. 
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architectures. This is an interesting example as one conhguration corresponds to a linear phy¬ 
logenetic relationship between cell types and the other to a branched phylogeny and represent 
fundamentally different evolutionary pathways. For this example, we wonld expect an efficient 
sampler to identify both conhgurations and to be able to move freely between the two during 
the simulation revealing the possibility of the existence of dual physical explanations for the 
observed data. 

Figures 2B and C display trace plots of the largest component weight, max (6) and the relative 
computational times for the three sampling schemes (see also SI: Figs. 2-3). All Hamming 
Ball samplers were effective at identifying both modes but the efficiency of the mode switching 
depends on the Hamming ball size m. This effectiveness can be attributed to the fact that the 
Hamming Ball schemes can jointly propose to change up to 2mN bits across all N columns of the 
current X. Furthermore, conditional updates of 9 can be made by marginalizing over a range of 
imitation matrices. For m > iF/2, the efficiency of the Hamming Ball Sampler is therefore close 
to the fully marginalized sampling procedure (m = K = 8) but more computationally tractable 
if the number of mutations is large and exhanstive enumeration is impractical. The conditional 
updates employed by the block Gibbs Sampler requires signihcantly less computational effort but 
the approach is prone to being trapped in single posterior mode and our simulations show that 
it failed to identify the mode corresponding to the linear phylogeny structure (max0 = 0.4). In 
real application this could lead to incorrect scientihc conclnsions and we illnstrate these potential 
impacts on a real cancer data set in SI: C.2. Tumor data analysis. 

2.2 Sparse linear regression analysis 

In this section we consider variable selection problems with sparse linear regression models. In 
this case, the high-dimensional discrete-valued object is a D x 1 binary vector X whose entries 
are 1 when the corresponding covariate in the design matrix Z is associated with the response 
y (and zero otherwise) and 6 consists of the regression and noise parameters (which can be 
analytically marginalized out in our chosen set-up, see Materials and Methods). The interest lies 
in the posterior distribution P(X|Z,y) which would inform us as to which covariates are most 
important for dehning the observed response variable. Typically X is assumed to be sparse so 
that only a few covariates contribute to the explanation of the observations. These sparse linear 
regression models can arise in problems snch as expression quantitative trait loci (eQTL) analysis 
which is concerned with the association of genotype measurements, typically single nucleotide 
polymorphisms (SNPs), with phenotype observations (see e.g. [5] for a review). 

We began by simnlating a regression dataset with Ai = 100 responses and D = 1, 200 covari¬ 
ates in which there were two relevant covariates that fully explain the data while the reminder 
were noisy redundant inputs. These two covariates were chosen to be perfect confounders so 
that only one of them is needed to explain the observed responses. As a consequence this sets up 
a challenging model exploration problem as only two out of possible models represent the 
possible truth. We applied a range of MCMC sampling schemes to sample from the posterior 
distribution over this massive space of possible models including three block Gibbs Samplers 
(BGl-3), which conditionally update blocks of elements of X of size AT = 1 — 3 respectively, 
and three block Hamming Ball samplers (HBl-3) that use blocks of size AT = 10 and consider 
Hamming ball radii of m = 1 — 3 respectively within each block. Note that the block Gibbs 
Samplers are special cases of the block Hamming Ball Sampler where K = m. 

Figure 3 compares the relative performance of the varions sampling schemes. The trace plots 
show the running marginal posterior inclusion probabilities of the two relevant variables xn and 
xgii which converge to the expected values of 0.5 with the Hamming Ball Samplers but not with 
the block Gibbs Samplers. This indicates that the Hamming Ball schemes were mixing well. 
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able to identify the two relevant variables and freqnently switched between their inclnsion. In 
contrast, the block Gibbs Samplers exhibited strong correlation effects (stickiness) that impaired 
their efficiency. 

For snch a high-dimensional problem, the performance of the simplest Hamming Ball Sampler 
(HBl) was particnlarly ontstanding as it nsed the least CPU time and achieved a lower integrated 
antocorrelation time than BGl and BG2. The performance can be explained by the fact that 
the Hamming Ball sampling schemes can handle a large block of variables at a time bnt do not 
reqnire exhanstive ennmeration of all possible latent variable combinations within each block. 
This provides an important compntational saving for sparse problems where most combinations 
will have low probability and the reason why the HBl sampler was particnlarly effective for this 
example. In SI: D.3. Real eQTL analysis we demonstrate the ntility of this methodology for a 
real eQTL example involving 10,000 covariates. 

2.3 Factorial hidden Markov models 

We finally address the application of the Hamming Ball sampling scheme to the factorial hidden 
Markov model (FHMM) [^. The FHMM is an extension of the hidden Markov model (HMM) 
[7] where mnltiple independent hidden chains rnn in parallel and cooperatively generate the 
observed data. The latent matrix X in this case represents a K x N S-valned discrete matrix, 
whose rows corresponds to K hidden Markov chains of length N. Posterior inference in FHMMs 
is extremely challenging since it concerns the compntation of P(X|y, 9) which comprises a fnlly 
dependent distribntion in the space of the possible confignrations. This is an extraordinarily 
large space for even small valnes of K and N. 

Applications of FHMMs therefore freqnently rely npon variational approximations or block 
Gibbs sampling which alternates between sampling a small set of rows of X, conditional on the 
rest rows [^. These Gibbs sampling schemes can easily become trapped in local modes dne the 
conditioning strnctnre which means major strnctnral changes to X are nnlikely to be proposed. 
Joint posterior npdates can be achieved by applying the forward-filtering-backward-sampling 
algorithm (FF-BS) [5], with exhanstive ennmeration, to simnlate a sample from the posterior in 
0{S‘^^N) time. However, althongh the nse of FF-BS is qnite feasible for even very large HMMs, 
it is only practical for very small valnes of K and N in FHMMs. 

We consider an additive FHMM with binary hidden chains which models the observation at 
time i according to y* = ^ki^k + Vi where is a parameter vector that describes the 

contribntion of the k-th featnre when generating yj (given that Xki = 1) and rji is Ganssian 
noise, i.e. r/i ~ AA(0,fT^I); see Material and Methods. Snch a model is nsefnl for applications 
snch as energy disaggregation, where an observed total electricity power at time instant i is 
the snm of individnal powers for all devices that are “on” at that time. We set np a simnlated 
seqnence of length N = 1000 and K = 10 hidden chains (simnlation details can be fonnd in SI: E 
Factorial hidden Markov models). For all sampling schemes, we treat the featnre contribntions 
Wfc as known parameters and we condnct inference on X, that models presence/absence of 
these featnres, together with the noise variance parameter a^. We applied three block Gibbs 
Samplers (BGl-3) and three Hamming Ball-based sampling schemes (HBl-3). As in the tnmor 
deconvolntion example, the Hamming Ball is defined as the set of all matrices X snch that each 
colnmn Xj is at most m bits different from the corresponding colnmn Uj of the anxiliary matrix 
U. We then applied FF-BS within the Hamming Ball to sample from the constrained posterior 
distribntion p(X|U,y). 

Figure 4 shows the utility of the different sampling schemes applied to the FHMM. Glearly, 
the Hamming Ball Samplers, and particularly the schemes HB2-3, are able to escape from local 
modes of the posterior distribution and sample values for X that have much higher posterior 
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probability than values sampled by block Gibbs Samplers. The conhgurations of X sampled 
by the HB2-3 schemes are also close to the true J^true that generated the data. Further, the 
latter algorithms were able to correctly infer the level of the noise variance that generated the 
data while the block Gibbs Samplers and HBl have inferred larger noise variances so that they 
wrongly explain some true signal variation as noise. Finally, we can conclude that HB2 is the 
scheme that best balances computational time and sampling efficiency since, while it has similar 
GPU time with the most advanced block Gibbs Samplers (BG2-3), it allocates computational 
resources very differently from the BG algorithms which results in signihcant improvement of the 
sampling efficiency. In SI: E.3. Energy disaggregation we demonstrate the utility of Hamming 
Ball sampling to a real energy disaggregation data sequence of length 67, 200. 

3 Discussion 

The Hamming Ball Sampler provides a generic sampling scheme for statistical models involving 
high-dimensional discrete latent state-spaces that generalizes and extends conventional block 
Gibbs sampling approaches. In our investigations, we have applied the Hamming Ball sampling 
scheme to three different statistical models and shown benehts over standard Gibbs samplers. 
Importantly, the Hamming Ball Sampler gives the statistical investigator control over the balance 
between statistical efficiency and computational tractability through an intuitive mechanism - 
the specihcation of the Hamming Ball radius and the block design strategy - which is important 
for Big Data applications. Yet, we have also demonstrated that in many problems, basic Ham¬ 
ming Ball samplers {m = 1) that are computationally inexpensive can still give relatively good 
performance compared to standard block Gibbs sampling alternatives. 

Throughout we have provided pure and unrehned Hamming Ball sampler implementations. 
In actual applications, the proposed methodology should not be seen as a single universal method 
for speeding up MGMG but a novel addition to the toolbox that is currently available to us. 
For example, the computations performed within each Hamming Ball update are often trivially 
parallelizable which would allow the user to take advantage of any special hardware for parallel 
computations, such as graphics processing units jOj [TO]. In addition, Hamming Ball sampling 
updates can also be used alongside standard Gibbs sampling updates as well as within parallel 
tempering schemes in Evolutionary Monte Garlo algorithms m- 

Finally, we believe the ideas presented here can have applications in many areas not yet 
explored, such as Bayesian nonparametrics (e.g. in the Indian Buffet Process) and Markov Ran¬ 
dom Fields. Further investigations are being conducted to develop the methodology for these 
statistical models. 

4 Materials and Methods 

4.1 Tumor deconvolution with mixture modelling 

We give a description of the statistical model underlying the tumor deconvolution example in 
the following. We assume that the data y = consists of N pairs of read counts 

where rj corresponds to the number of sequence reads corresponding to the variant allele at 
the Tth locus and di is the total number of sequence reads covering the mutation site. The 
distribution of the variant allele reads given the total read count follows a Binomial distribution 
Tj ~ Binomial(dj, (/)j), i = 1,... ,N, where the variant allele frequency is given by (pi = {1 — 
e)pi + e(l — Pi) and e is a sequence read error rate and Pi = ^ ^k^ki- The parameter 9 is 

a X X 1 vector denoting the proportion of the observed data sample attributed to each of the 
K tumor subpopulations whose genotypes are given hy & K x N binary matrix X. We specify 
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the prior probabilities over 0 as 9k = x*" —j k = 1,... ,K, and 7 ^ ~ Gamma(a/-fC, 1), k = 

Z^i=i 7j 

1.. .. ,K, This hierarchical representation is equivalent to a marginal prior distribution 0|a ~ 
Dirichlet(a/i^,... ,ajK) which induces a sparsity forcing values of 9 to be close to zero when 
a < 1 allowing us to do automatic model selection for the number of tumor sub-populations. 
We further specify the prior probabilities over X as Xki \ fi Bernoulli(xfci, fi), i = l,...,N,k = 

1.. ..,K, and fi\fa,fi3 Beta{fa, fp), i = 1,... ,N. Further details for posterior inference for 
this model and data simulation are given in SI: C Tumor deconvolution with mixture modelling. 

4.2 Sparse linear regression analysis 

Here, we provide a description of the statistical sparse linear regression model used in the 
examples. Suppose a dataset where G IR is the observed response and Zj G 

is the vector of the corresponding covariates. We can collectively store all responses in a X x 
1 vector y (assumed to be normalized to have zero mean) and the covariates m a N x D 
design matrix Z. We further assume that from the total D covariates there exist a small 
unknown subset of relevant covariates that generate the response. This is represented by an 
T)-dimensional unobserved binary vector X that indicates the relevant covariates and follows 
an independent Bernoulli prior distribution, Xd ~ Bernoulli(xd, tto), d = where ttq is 

assigned a conjugate Beta prior, Beta( 7 ro|a 7 ro, and (aTro^^Tro) are hyperparameters. Given 
X, a Gaussian linear regression model takes the form y = Zx(3x T t], rj rsj 
where Zx is the N x Dx design matrix, with Dx = 'f2d=i having columns corresponding 
to Xd = 1 and (3x is the respective Dx x 1 vector of regression coefficients. The regression 
coefficients /3x and the noise variance are assigned a conjugate normal-inverse-gamma prior 
p(/ 3 x)< 7 ^|X) = AA(/3x| 0, g(Zx.^x)~^)IiivGa((T^|ao-, ^(t), where {g,aa,ba-) are hyperparameters. 
Notice that the particular choice g{Z'^Zx)~^ for the covariance matrix , where is g is scalar 
hyperparameter, corresponds to the so called 5 -prior |12) . Based on the above form of the 
prior distributions we can analytically marginalize out the parameters 9 = (tto,/ 3x) <7^) and 
obtain the marginalized joint density [13]: p(y, X|-) oc G (26^ + , where 

G = (1 + g)-^^/^r{Dx + a^,)T{D - Dx + b^,), 5(X) = y^y - Zx{Z]^Zx)-^Z^y 

and r(-) denotes the Gamma function. The hyperparameters of the prior were set to fixed 
values as follows. The hyperparameters of InvGa(cj^|Q;o-, 60 -) were set to = 0.1 and b„ = 0.1 
which leads to a vague prior. The scalar hyperparameter for the 5 -prior were chosen to 5 = X 
as also used in |13| . Finally, the hyperparameters for the Beta prior, Beta( 7 ro|Q; 7 rQ, were 
set to the values Ojro = 0.001 and 67^0 = 1 which favors sparse configurations for the vector X. 
Further details for inference for this model and data simulation are given in SI: D Sparse linear 
regression analysis. 

4.3 Factorial hidden Markov models 

In a typical setting of modelling with FHMMs, the observed sequence y = (yi,---,y 7 v) is 
generated through K binary hidden sequences represented hy a K x N binary matrix X = 
(xi,..., xtv)- The interpretation of the latter binary matrix is that each row encodes for the pres¬ 
ence or absence of a single feature across the observed sequence while each column x, represents 
the different features that are active when generating the observation y^. Different rows of X cor¬ 
respond to independent Markov chains following p{xki\xki-i) = (1 — pf.)^(^ki=^ki-i)^ 
and where the initial state Xki is drawn from a Bernoulli distribution with parameter Vk- Each 
data point y* is generated conditional on Xj through a likelihood model p(yj|xj) parametrized 
by (f). For the additive FHMM this likelihood model takes the form p(yjXj) = AA(yj|wo -|- 
where </> = {wq, ..., cr^} are the parameters. The whole set of model 
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parameters 9 = {/Ofc, determines the joint probability density over (y,X) which is 

written as p(y,X|6») = P(y*|xi)) (U.k=iPixki)Yl^= 2 Pixki\xki-i)^ ■ The Hamming ball 

algorithm follows precisely the iteration in (§- 0 . The step ([^ when implemented by two 
separate Gibbs step reqnires sampling from the posterior conditional p(X|0,U,y) expressed as 
p(X|6',U,y) oc <m)^p(X) where we nsed that I(X G UmiU)) = 

nilil(d(xi,ui) < m). Given that each Xj is restricted to take M valnes in the neighborhood 
of Uj, exact sampling from the above distribntion can be done nsing the FF-BS algorithm in 
0{M‘^N) time. If we implement step ([^ nsing the M-H joint npdate, then we need to evalnate 
the p{9, \J,y) which is simply the normalizing constant of the distribntion p(X|0, U,y) that is 
obtained by the forward pass of the FF-BS algorithm in 0{M‘^N) time. Fnrther details for pos¬ 
terior inference for this model and data simnlation are given in SI: E Factorial hidden Markov 
models. 
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1 Hamming Ball Sampler Illustration. Panel (A) illustrates a Hamming Ball update 

(m = 1) for a 2 X 3 binary matrix to via where the subsets 

(x, u) correspond to columns of the matrix. Panel (B) illustrates a block strategy 
for the application of Hamming Ball sampling when X is a D x 1 vector split into 
random blocks of size K . . m 

2 Tumor deconvolution. (A) Two distinct clonal architectures that lead to the same 
mutant allele frequency vector (p = [0.5,0.3,0.15]'. (B) Trace plots showing the 
sampled values of max(0). (C) Relative computational times for the Hamming 

Ball Sampler for various m (times relative to the block Gibbs Sampler). [14] 

3 Comparison of block Gibbs and Hamming Ball sampling schemes for the sim¬ 

ulation regression example. Top and middle rows give trace plots showing the 
running marginal posterior inclusion probabilities for xn (black) and xgn (blue). 
Bottom row shows CPU times, integrated autocorrelation times (lAT) and effec¬ 
tive sample size (ESS) estimates for each method. [15] 

4 Comparison of block Gibbs and Hamming Ball sampling schemes for the simu¬ 

lation FHMM example. Top row shows CPU times and log joint density values 
during sampling. The bottom row shows the number of elements in X that differ 
from the true matrix ^true that generated the data and the Monte Carlo posterior 
densities over the inferred noise variance (the true value was 0.01). [16] 
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Figure 1: Hamming Ball Sampler Illustration. Panel (A) illustrates a Hamming Ball update 
(m = 1) for a 2 X 3 binary matrix to via where the subsets (x, u) correspond 

to columns of the matrix. Panel (B) illustrates a block strategy for the application of Hamming 
Ball sampling when X is a D x 1 vector split into random blocks of size K. 
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Figure 2: Tumor deconvolution. (A) Two distinct clonal architectures that lead to the same 
mutant allele frequency vector (j) = [0.5,0.3,0.15]'. (B) Trace plots showing the sampled values 
of max(0). (C) Relative computational times for the Hamming Ball Sampler for various m 
(times relative to the block Gibbs Sampler). 
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Figure 3: Comparison of block Gibbs and Hamming Ball sampling schemes for the simulation 
regression example. Top and middle rows give trace plots showing the running marginal posterior 
inclusion probabilities for xn (black) and xgn (blue). Bottom row shows CPU times, integrated 
autocorrelation times (lAT) and effective sample size (ESS) estimates for each method. 
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Figure 4: Comparison of block Gibbs and Hamming Ball sampling schemes for the simulation 
FHMM example. Top row shows CPU times and log joint density values during sampling. The 
bottom row shows the number of elements in X that differ from the true matrix ^true that 
generated the data and the Monte Carlo posterior densities over the inferred noise variance a'^ 
(the true value was 0.01). 
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